Model Comparison Dashboard

Auto-refreshes every 30 seconds. Last updated: 2025-01-27 10:58:42

Executive Summary: Model Comparison Analysis ========================================== Models Compared: gpt-4o vs llama-3.3-70b-versatile Sample Size: 100 questions Original Text Performance ------------------------ - Both models correct: 89 (89.0%) - gpt-4o only correct: 6 (6.0%) - llama-3.3-70b-versatile only correct: 2 (2.0%) - Both incorrect: 2 (2.0%) Misspelled Text Performance -------------------------- - Both models correct: 59 (59.0%) - gpt-4o only correct: 19 (19.0%) - llama-3.3-70b-versatile only correct: 8 (8.0%) - Both incorrect: 12 (12.0%) Key Findings ----------- 1. Model Performance: On original text, both models (gpt-4o and llama-3.3-70b-versatile) achieved 89.0% accuracy together, while on misspelled text this dropped to 59.0% 2. Robustness: llama-3.3-70b-versatile showed 67.0% accuracy on misspelled text vs gpt-4o's 78.0% 3. Reliability: Both models showed similar reliability in terms of response completion